Back

The survey focuses on evaluating large language models (LLMs), specifically their capabilities and their alignment with human values. The taxonomy and roadmap of the survey are as follows:

  1. Introduction: Introduces the concept of machine intelligence and the need for evaluating LLMs.

  2. Taxonomy and Roadmap: Presents a taxonomy framework for evaluating LLMs, which includes five fundamental domains: Knowledge and Capability Evaluation, Alignment Evaluation, Safety Evaluation, Specialized LLMs Evaluation, and Evaluation Organization.

  3. Knowledge and Capability Evaluation: Discusses the evaluation of LLMs’ knowledge and reasoning capabilities, including question answering, knowledge completion, reasoning (common sense, logical, multi-hop, and mathematical), and tool learning.

  4. Alignment Evaluation: Focuses on evaluating the alignment of LLMs with human values, covering ethics and morality, bias detection, toxicity assessment, and truthfulness evaluation.

  5. Safety Evaluation: Explores the evaluation of LLMs’ robustness and the assessment of risks associated with their behaviors and potential misuse.

  6. Specialized LLMs Evaluation: Examines the evaluation of LLMs in specialized domains such as biology and medicine, education, legislation, computer science, and finance.

  7. Evaluation Organization: Provides an overview of existing benchmarks and evaluation methodologies for LLMs, including benchmarks for natural language understanding and generation, knowledge and reasoning, and holistic evaluation.

  8. Future Directions: Discusses future research directions, including risk evaluation, agent evaluation, dynamic evaluation, and enhancement-oriented evaluation for LLMs.

  9. Conclusion: Summarizes the main findings of the survey.

The survey aims to provide a comprehensive overview of the current state of LLM evaluation research, examining both capabilities and alignment aspects. It expands on existing surveys by integrating insights across different evaluation categories and providing a more holistic characterization of LLM evaluation. The taxonomy framework helps structure the survey and allows readers to gain a nuanced understanding of LLM performance and challenges in diverse domains.

Words: 291